How to split a raw file or a data set into many external raw files

21

Splitting external raw/text/flat files into multiple files

Recently, I worked on a cybersecurity project that entailed processing a staggering number of raw text files about web traffic. Millions of rows had to be read and parsed to extract variable values.

The problem was complicated by the varying records composition. Each external raw file was a collection of records of different structures that required different parsing programming logic. Besides, those heterogeneous records could not possibly belong to the same rectangular data tables with fixed sets of columns.

Solving the problem

To solve the problem, I decided to employ a "divide and conquer" strategy: to split the external file into many files, each with a homogeneous structure, then parse them separately to create as many output SAS data sets.

My plan was to use a SAS DATA Step for looping through the rows (records) of the external file, read each row, identify its type, and based on that, write it to a corresponding output file. Similar to how we would split a data set into many:

 
data CARS_ASIA CARS_EUROPE CARS_USA;
   set SASHELP.CARS;
   select(origin);
      when('Asia')   output CARS_ASIA;
      when('Europe') output CARS_EUROPE;
      when('USA')    output CARS_USA;
   end;   
run;

But how do you switch between the output files? The idea came from SAS' Chris Hemedinger, who suggested using multiple FILE statements to redirect output to different external files.

Splitting an external raw file into many

As you know, one can use PUT statement in a SAS DATA Step to output a character string or a combination of character strings and variable values into an external file. That external file (a destination) is defined by a FILE statement. While that destination can only be one at a time (we can’t write into multiple raw file destinations simultaneously), there is no restriction on the number of FILE statements in a DATA Step; however, only one FILE statement at a time is active.

Whenever you define a new destination with the FILE statement, it will overwrite its previous definition, but that is exactly what we need.
Let’s say we need to split an external raw file input_file.txt into several raw files based on the value of some field REC_TYPE indicative of the record type. Then we can achieve that as shown in the following code example:

 
filename inf  'c:\temp\input_file.txt';
filename out1 'c:\temp\traffic.txt';
filename out2 'c:\temp\system.txt';
filename out3 'c:\temp\threat.txt';
filename out4 'c:\temp\other.txt';
 
data _null_;
   infile inf;
   input REC_TYPE $10.;
   select(REC_TYPE);
      when('TRAFFIC') file out1;
      when('SYSTEM')  file out2;
      when('THREAT')  file out3;
      otherwise       file out4;
   end;
   put _infile_;
run;

In this code, the INPUT statement retrieves the value of REC_TYPE. It may not be used exactly as written, but the point is you need to capture the filed(s) of interest.

But the INPUT statement does something else behind the scene. First, SAS creates an input buffer and loads the entire raw data line into it. The Input Buffer's value gets assigned to the _INFILE_ automatic variable. Then REC_TYPE variable value is moved to the Program Data Vector (PDV).

After that INPUT statement, two variables are available in the DATA Step - REC_TYPE and _INFILE_ (automatic variable).

Depending on the value of the REC_TYPE variable, SELECT block toggles the FILE definition between one of the four filerefs, out1, out2, out3, or out4.

Then the PUT statement outputs the _infile_ automatic variable value to the output file defined in the SELECT block.

Splitting a data set into several external files

Similar technique can be used to split a data table into several external raw files. Let’s combine the above two code samples to demonstrate how you can split a data set into several external raw files:

 
filename outasi 'c:\temp\cars_asia.txt';
filename outeur 'c:\temp\cars_europe.txt';
filename outusa 'c:\temp\cars_usa.txt';
 
data _null_;
   set SASHELP.CARS;
   select(origin);
      when('Asia')   file outasi;
      when('Europe') file outeur;
      when('USA')    file outusa;
   end;
   put (_all_)(=); 
run;

This code will read observations of the SASHELP.CARS data table, and depending on the value of ORIGIN variable, put (_all_)(=); statement will output all the variables in their physical order as named values (VARIABLE_NAME=VARIABLE_VALUE pairs) to one of the three external raw files specified by their respective file references (outasi, outeur, or outusa.)

NOTE: If you use put _all_; statement, then SAS will output all the variables including automatic variables _ERROR_ and _N_.

You can modify this code to produce delimited files with full control over which variables and in what order to output. For example, the following code sample produces 3 files with comma-separated values:

 
data _null_;
   set SASHELP.CARS;
   select(origin);
      when('Asia')   file outasi dlm=',';
      when('Europe') file outeur dlm=',';
      when('USA')    file outusa dlm=',';
   end;
   put make model type origin msrp invoice; 
run;

You may use different delimiters for your output files. In addition, rather than using mutually exclusive SELECT, you may use different logic for re-directing your output to different external files.

Bonus: How to zip your output files as you create them

For those readers who are patient enough to read to this point, here is another tip. As described in this series of blog posts by Chris Hemedinger, in SAS you can read your external raw files directly from zipped files without unzipping them first, as well as write your output raw files directly into zipped files. You just need to specify that in your filename statement. For example:

UNIX/Linux

 
filename outusa ZIP '/sas/data/temp/cars_usa.txt.gz' GZIP;

Windows

 
filename outusa ZIP 'c:\temp\cars.zip' member='cars_usa.txt';

Your turn

What is your experience with creating multiple external raw files? Could you please share with the rest of us?

Share

About Author

Leonid Batkhan

Leonid Batkhan is a long-time SAS consultant and blogger. Currently, he is a Lead Applications Developer at F.N.B. Corporation. He holds a Ph.D. in Computer Science and Automatic Control Systems and has been a SAS user for more than 25 years. From 1995 to 2021 he worked as a Data Management and Business Intelligence consultant at SAS Institute. During his career, Leonid has successfully implemented dozens of SAS applications and projects in various industries. All posts by Leonid Batkhan >>>

21 Comments

  1. I have a similar situation, where I need to break up a file with over 5.5 million records into smaller files for processing through a module on a z\OS mainframe. Lengths of records are highly variable, but there's an identifying number at the beginning of each, and the final digit of that ID is at byte 13 of the file. This file actually contains a sample of a larger dataset, and since it's large enough, I know that the number of records for each terminal digit will be approximately the same, around 560,000-590,000 records. So a simple split criterion in this case is one file for each terminal digit.

    I've taken a macro approach, using SAS CONNECT, which means I need to dynamically allocate the filenames before writing to them. Here, an active SAS CONNECT session is already established.

    The macro definition must be inside an rsubmit block, as it will be called in the server session. Note also that the count variable in the loop is set to begin at 0 rather than 1. This is so that each value of count corresponds to one of the unique ID's terminal digits (0-9).

    rsubmit;
     
       ************* EDIT THESE VARIABLES *************;
    * input source filename;
    %let myinfile='source.file.on.main.frame';
     
    * base name of the files to allocate, with no numeric suffix;
    %let filestem=username.src.slice;
     
    * number of slice files to allocate, minus 1;
    %let slicenum=9;
      ************** END OF VARIABLES TO EDIT **************;
     
    filename src &myinfile;
     
    %macro slicefile;
      %do count=0 %to &slicenum;
     
        %*allocate file;
        filename slice&&count "&filestem&&count" disp=(new,catlg) label=(,,,,expdt=99365) 
            space=(15643,(10000,2000),rlse) recfm=vb lrecl=15643 noprompt;
     
        %*write data for one terminal digit to the file;
        data _null_;
          infile src;
          file slice&&count;
          input @13 trm 1.;
          if trm = &count;
          put _infile_;
        run;
     
        %*clear filename to release lock on the file;
        filename slice&&count;
      %end; %* do loop;
     
    %mend slicefile;
    %slicefile
     
    endrsubmit;

    I'm less than happy that this must read the entire input file ten times, but I haven't yet had time to work out how to get it to write to a different file based on the terminal ID digit on a line by line basis during a single read of the input file.

    • Leonid Batkhan

      Hi Rebekah, you are in the right place. The code example under "Splitting an external raw file into many" section of this blog shows how we can dynamically allocate filenames before writing to them. Your macro can be implemented as follows:

      %macro slicefile;
         %do i=0 %to &slicenum;
            filename slice&i "&filestem&&i" disp=(new,catlg) label=(,,,,expdt=99365) 
               space=(15643,(10000,2000),rlse) recfm=vb lrecl=15643 noprompt;
         %end;
       
         data _null_;
            infile src;
            input;
            select(char(_infile_,13));
            %do i=0 %to &slicenum;
               when(&i) file slice&i;
            %end;
            end;
            put _infile_;
         run;
       
         %do i=0 %to &slicenum;
            filename slice&i;
         %end;
      %mend slicefile;

      That is you are going to read your source file only once. By the way, you don't need double ampersand in slice&&count, one is enough; so I used slice&i. I also used macro loop to generate 10 when( ) statements to dynamically re-allocate filenames, but you could use 10 when( ) statements instead:

         when(0) file slice0;
         when(1) file slice1;
         /* ... */
         when(9) file slice9;
      

      Please let me know if this is what you were looking for.
      Best,
      Leonid

      • Hi Leonid,

        My intention was just to share a similar case, while also acknowledging the flaw present in this iteration of approaching the issue, so I can't exactly say it's what I was "looking for" -- but I sure am happy to see it! It works beautifully, and as a bonus has a pleasing elegance to it. It seems my using a double ampersand where not needed, is due to a mental holdover from typing &filestem&&count!

        Your solution also reshaped my thinking on a closely related piece of code I hope to get to soon, which will split files based on a user-defined number of lines rather than a value taken from the current line. The challenge is that the file being read in has an unknown number of lines, so there is no longer a predefined number of output files. I'll share once I've had a chance to work through it.

        Thanks again,
        Rebekah

          • Hi Leonid,

            My solution won't generalize to as many situations as yours, but is a good fit for my colleagues who are most likely to use it. I typically set things up so they can edit key macro variables in a clearly delineated section at the top of a SAS file, and then submit the entire file.

            It will take a variable format external file on a z/OS system, regardless of how large, and split it into files of a user-specified number of lines, up to a maximum of 999999 lines per file, allocating the files on the fly. From looking at this code, it may not be obvious how your previous response helped me with this. You know how when you're working your way iteratively through attempts to solve a problem, they tend to become progressively more complex? Your post reminded me that sometimes, it's more effective to refocus on simplifying your approach.

            Here's where that got me:

            rsubmit;
               ************* EDIT THESE VARIABLES *************;
            * input source filename;
            %let myinfile='source.file.on.main.frame';
             
            * base name of the files to allocate, with no numeric suffix;
            %let filestem=username.output.slice;
             
            * max number of records to write per file;
            %let maxrecs=600000;
             
              ************** END OF VARIABLES TO EDIT **************;
             
            data _null_;
              infile &myinfile;
              if _n_ eq 1 then do;
                stem = symget('filestem');
                * specifying 6. limits maxrecs to < 1 million records;
                maxrecs = input(symget('maxrecs'),6.);
                filenum = 0;
                retain stem maxrecs filenum;
              end;
             
              if (mod(_n_,maxrecs) eq 1) then do;
                outfile = cats("'",stem,filenum,"'");
                rc = filename('slice', outfile, 'disk','disp=(new,catlg) label=(,,,,expdt=99365) space=(15643,(10000,2000),rlse) recfm=vb lrecl=15643 noprompt');
                filenum = filenum + 1;
              end;
             
              input;
              file dummy filevar=outfile;
              put _infile_;
            run;
             
            filename _all_ clear;
            endrsubmit;

            Thanks again,

            Rebekah

            • Leonid Batkhan

              Thank you, Rebekah, for sharing your code iteration. Very impressive and efficient use of FILENAME() function and FILEVAR= option of FILE statement!

              One area which could be improved on (at least stylistically) is the IF _n_=1 THEN DO; ... END;
              First of all, RETAIN statement is declarative (non-executable) statement and it should not be placed into conditional DO-group. It will be "executed" regardless of whether condition is true or false as it takes effect right after compilation before DATA step iterations. It behaves similar to SAS global statements (see my recent post How to conditionally execute SAS global statements.) Second, you can assign initial values in the RETAIN statement itself, so instead of the whole IF-THEN-DO construct you can make it a one-liner:
              retain stem "&filestem" maxrecs &maxrecs filenum 0;

              Another, in your outfile = cats("'",stem,filenum,"'"); you don't need those "'", you can write it simply as outfile = cats(stem,filenum); as you don't need single quotation marks to be part of the outfile variable value.

              Will it work for you?
              Best regards,
              Leonid

              • Hi Leonid,

                Thank you! I'm in the (sub-optimal) habit of using an IF _N_ = 1 block to visually corral anything that isn't meant to execute with each iteration of the data step, whether declarative or executable. I hadn't even considered initializing those variables in the retain statement, and will definitely be adding that to my arsenal.

                As far as the single quotes in the outfile variable, I don't have those there for SAS but for z/OS, to indicate that it's a fully qualified path and prevent z/OS from prefixing the variable value with the username of the person running the program. It's just another of the vagaries of programming in a local SAS session on Windows, to connect to and execute code in a remote SAS session on the mainframe.

                I'm going to check out your post on conditionally executing global statements; I'm very curious to know if your method resembles mine. I have code that conditionally specifies which external code to include. The conditional logic assigns the appropriate path to the fileref nextsas, and after the code block is closed out, the include statement calls whichever file's path is assigned to nextsas. I haven't run into another circumstance where I need to do that, and based on what I've seen of your code so far, I expect you've got more examples.

                Best,
                Rebekah

  2. Hi Leonid,

    I've been looking for something like this but I'm wondering how can I amend your code for my needs.

    Basically, I have a .json file with 6.7million records. Currently it is too large to use the sas json mapper, therefore I need to split it out.

    filename json3 '\\C\Rec_20190402\total_json.json';
    filename out1 '\\C\Rec_20190402\paye_1.json';
    filename out2 '\\C\Rec_20190402\paye_1.json';
    filename out3 '\\C\Rec_20190402\paye_1.json';

    data _null_;
    infile json3;
    if _n_ ge 500000 then output out1;
    ****and so on
    put _infile_;
    run;

    Basically I would like to ouput the large files into smaller ones based on when _N_ is a value.

    I'm having trouble trying to get working code though and would appreciate some advice.

    • Leonid Batkhan

      Hi Sean, I see several places where your code went wrong. First, you are missing INPUT statement that reads a record; second, you use OUTPUT statement which is used for datasets, for files you need to use FILE statement. Your output files are defined identically. You also don't provide your IF-THEN logic ("****and so on" is not enough). The devil is in details. Here is how I would arrange your code:

      filename json3 'C:\Rec_20190402\total_json.json';
      filename out1 'C:\Rec_20190402\paye_1.json';
      filename out2 'C:\Rec_20190402\paye_2.json';
      filename out3 'C:\Rec_20190402\paye_3.json';
      
      data _null_;
         infile json3;
         input;
         if _n_ le  500000 then file out1; else
         if _n_ le 1000000 then file out2; else file out3;
         put _infile_;
      run;
      

      Adjust it as needed. Please let me know if it solves your problem.

  3. Great post. re "_INFILE_ is available immediately after any INPUT statement is executed so no need for @ to hang onto the record" I would agree. There are examples of it in Howard Schreier's paper https://www.lexjansen.com/nesug/nesug01/cc/cc4018bw.pdf. I did a look through the docs as well. The closest I could find is this walk through of how the data step works. https://documentation.sas.com/?docsetId=lrcon&docsetTarget=p18vk5t9cwort1n18g7zg2no6tr4.htm&docsetVersion=9.4&locale=en. I think the key point is that any INPUT statement reads a record into the input buffer referenced by _INFILE_ (even a null input;) , and that the input buffer is not cleared automatically. It is retained even across iterations of the data step loop. The only time the content of the input buffer is changed is when new data is read into it by another execution of an INPUT statement (assuming there was not trailing @/@@ on the prior INPUT statement). As I see it.

    • Leonid Batkhan

      Hi Quentin,
      Thank you for your comment and the references.
      1) In his paper, Howard Schreier uses _INFILE_ automatic variable after the following statement:
      input @;
      That is not after “any” input statement. This INPUT statement does not specify any variables to read, and that is exactly how I understood it and used in this blog post as the second (input; ) statement (I did not need trailing @ as there were no consequent input statements).
      2) Your second reference - SAS documentation does explain when input buffer is created: “If the input file contains raw data (as in the example above), SAS creates an input buffer to hold the data before moving the data to the program data vector (PDV).” Then “SAS reads the first data line into the input buffer.” However, the doc refers to the example of multiple variables specified in the INPUT statement where the whole record indeed needs to be read.
      What is counterintuitive (at least to me) is that even when you read just a single (even first) character, for example using
      input @1 firstchar $1;
      SAS will still read the whole record of up to 32KB long, load it into the record input buffer, and subsequently assign that whole record value to _INFILE_.
      I believe that the confusion could be dispelled by explicitly stating that any INPUT statement not following another INPUT statement with trailing @ or @@, always loads the entire data row into the input buffer and assigns its value to the _INFILE_ automatic variable, regardless of whether any variables or how many of them are specified with that INPUT statement. If an INPUT statement follows another INPUT statement with trailing @ or @@, then it skips reading the data row into the input buffer and updating the _INFILE_ automatic variable, and proceeds right to moving the data from the input buffer to the PDV.

      • I like your proposed addition to the documentation. Actually, it got me wondering about the LRECL option. There isn't a lot of documentation on that either. : ) Maybe one part of it is that the LRECL option (whether system option or option on the infile statement) determines the length of the input buffer (I think, I'm not sure!) and that determines the amount of data read into the input buffer by the INPUT statement. So even if the INPUT statement reads no variables into the PDV, it will still read LRECL characters into the input buffer and assign that value to _INFILE_.

        • Leonid Batkhan

          Thank you, Quentin. It seems logical that LRECL= determines the size of the input buffer, although since it is not stated explicitly in the documentation it is open to interpretation. As to how that LRECL value affects performance I will do some benchmarking.

  4. Peter Lancashire on

    I know this is a blog about SAS, but I would question whether SAS is the right tool for this. The ancient unix awk tool is designed to do exactly this. Something like this one-line script would do what you want:
    print >> $1
    .
    Almost any structured text file editing is hard in SAS. I often use awk. The program file is 152kB in Windows. The GNU version is a bit bigger.
    .
    I imagine there is a clean and efficient way to do this in python too, but at the cost of more code bloat.

    • Leonid Batkhan

      Hi Peter. Thank you for your comment (non-SAS views a welcome here too.) While this particular task (splitting text files into many) can be solved by a variety different tools, it is just a part of a bigger picture where the power of SAS shines.

      By the way, your one-line script example is a bit of exaggeration; it does not do what is needed here as it lacks criteria of splitting the source file as well as specifying the output files.

      In SAS we can run any OS command using x-statement (x ‘any-operating-system-command’; ) without leaving SAS environment, which is not more complicated than running that command directly in a command window. However, because some IT policies prohibit using OS command in SAS we also rely on other ways of doing the same tasks.

      SAS is a great integrator tool; it allows a transparent access to multiple heterogeneous data source, as well as combining different languages into one automated process. You can run R or Python from within SAS as well.

      For the project I describe in this blog post, SAS is definitely the right tool for an automated process that includes reading external files, parsing them, validating and producing SAS tables that are loaded into SAS Visual Analytics for further analysis and modeling. I am just illustrating how seemingly non-SAS task can be solved by SAS to ensure smooth flow of execution.

      I might surprise you, but we can use SAS to create UNIX scripts too 🙂 .

  5. Hi Leonid,
    I'm what you call a lazy programmer. Maintenance is boring and much disliked. Have you ever used the FILEVAR= option on the FILE statement. In this case doing so will allow the REC_TYPE to be any value and you won't have to change your program as REC_TYPE values change. The MOD option says to append to the file if it exists, so you may have to remove the created text files if you want fresh ones for each run. Additionally, _INFILE_ is available immediately after any INPUT statement is executed so no need for @ to hang onto the record. Try this.

    %let Folder=C:\Temp;
    filename inf  "&Folder.\input_file.txt";
     
    data _null_;
       infile inf;
       input REC_TYPE $10. ;
       whichFile="&Folder.\"||strip(REC_TYPE)||".txt";
       file dummy filevar=whichFile MOD;
       put _infile_;
    run;
    

    • Leonid Batkhan

      Hi Jerry,
      Thank you for stopping by and providing an alternative data-driven way of doing it. The FILEVAR= usage is described in the SAS Support Sample 24599: Create multiple files from one data set. There is a couple of considerations. According to that Sample, "When creating multiple files with the FILEVAR= option, the files must have the same layout. When the value of the FILEVAR= variable changes, SAS closes the current file and opens a new file." That is why you need MOD. If you use multiple FILE statements as described in this blog post, the files remain open, and you don't need MOD; you just switch between them for writing, and you don't have to remove them to start fresh between runs. Also, in your code example you use REC_TYPE for both, record type identifier and the output file name. In more general case of them being different you would end up with SELECT-END block (or its alternative) for re-coding REC_TYPE into file name. So, all in all, "lazy" programming may turn out to be more "expensive".

      Regarding your second point that "_INFILE_ is available immediately after any INPUT statement is executed so no need for @ to hang onto the record", - it is somewhat counter-intuitive in a sense that even when INPUT statement specifies reading a single first character from the raw file SAS reads in the entire row regardless. Is that efficient? But it seems to be working that way nevertheless. That also means that the second input statement is redundant. I will double check on this, and if it indeed works without limitations, this "hack" alone is worth a fortune. Thank you for sharing!

      • Hi Leonard,
        I think Sample 24599 can mislead when it says "the files must have the same layout" - it really means little more than you can't change other file specification options (such as RECFM), it doesn't mean you can't arrange data elements in different orders, create repeating groups or anything like that. Using FILEVAR= has the advantage that you don't need to know the 'rec-types' or number of files being created before-hand, as you would with constructs like SELECT ... WHEN ...
        In any case, if you're doing little else than PUTting the_INFILE_ variable to the output files then they'd effectively have the same 'layout'.
        I've used this many times to implement data-driven ETL.

        • Leonid Batkhan

          Hi David,
          Thank for your comment. I agree with you that the FILEVAR= option has an advantage in those cases where we don’t know the number of files being created, and that is where a data-driven approach is totally appropriate. However, I dare to say that “data-driven” is not always the best method. For the use case I describe in this blog post, where an external file is split based on the record type and where output files are known in advance, I prefer not to use a data-driven implementation for better code clarity and readability. I have also not seen any written material on the usage of multiple FILE statements in a SAS data step and wanted to shed light on this method in this post.

    • Leonid Batkhan

      Hi Jerry, based on your input I revised my original code

      data _null_;
         infile inf;
         input REC_TYPE $10. @;
         input;
         select(REC_TYPE);
         . . .
      

      to eliminate the trailing @ and consequently the second input statement:

      data _null_;
         infile inf;
         input REC_TYPE $10.;
         select(REC_TYPE);
         . . .
      

      and provided corresponding explanation.

      Thank you again for your INPUT (pun is coincidentally intended 🙂 .)

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top